Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: bench: minor fixes #10765

Merged
merged 2 commits into from
Jan 2, 2025
Merged

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Dec 10, 2024

Context

After a nice exchange with @ngxson, this is a minor change to the current server bench framework in order to refresh it a bit. Although the target would be to replace k6/xk6-sse with something like Locust (to be assessed) , python based.

Changes

  • support openAI streaming standard output with [DONE]\n\n
  • export k6 raw results in csv
  • fix too many tcp idle connection in tcp_wait
  • add metric time to emit first token
  • wait for the server to be ready in the CI script

Tests (phi2 on RTX 3050)

LLAMA_SERVER_BIN_PATH=../../../cmake-build-debug/bin/llama-server python bench.py \
              --runner-label local \
              --name local \
              --branch `git rev-parse --abbrev-ref HEAD` \
              --commit `git rev-parse HEAD` \
              --scenario script.js \
              --duration 5m \
              --hf-repo ggml-org/models	 \
              --hf-file phi-2/ggml-model-q4_0.gguf \
              --model-path-prefix models \
              --parallel 4 \
              -ngl 33 \
              --batch-size 2048 \
              --ubatch-size	256 \
              --ctx-size 4096 \
              --n-prompts 200 \
              --max-prompt-tokens 256 \
              --max-tokens 256

Results:

srv  update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200

     ✓ success completion

     checks.....................................: 100.00% 165 out of 165
     data_received..............................: 4.5 MB  15 kB/s
     data_sent..................................: 96 kB   306 B/s
     dropped_iterations.........................: 35      0.111853/s
     http_req_duration..........................: avg=7.1s       min=794.72ms   med=4.13s      max=30.06s     p(90)=17.87s     p(95)=26.41s    
     http_req_sending...........................: avg=4.82ms     min=2.38ms     med=3.71ms     max=22.05ms    p(90)=7.24ms     p(95)=11.5ms    
     http_reqs..................................: 165     0.527306/s
     iteration_duration.........................: avg=7.4s       min=1.09s      med=4.43s      max=30.36s     p(90)=18.17s     p(95)=26.71s    
     iterations.................................: 165     0.527306/s
     llamacpp_completion_tokens.................: avg=126.915152 min=12         med=73         max=512        p(90)=327.6      p(95)=462.2     
     llamacpp_completion_tokens_total_counter...: 20941   66.923093/s
     llamacpp_completions_stop_rate.............: 95.75%  158 out of 165
   ✓ llamacpp_completions_truncated_rate........: 4.24%   7 out of 165
     llamacpp_emit_first_token_second...........: avg=0.161764   min=0.081      med=0.135      max=0.673      p(90)=0.2592     p(95)=0.3168    
     llamacpp_prompt_processing_second..........: avg=575.677944 min=100.149477 med=575.471698 max=877.862595 p(90)=724.111718 p(95)=761.987131
     llamacpp_prompt_tokens.....................: avg=91.824242  min=57         med=71         max=473        p(90)=148        p(95)=215.6     
     llamacpp_prompt_tokens_total_counter.......: 15151   48.419454/s
     llamacpp_tokens_second.....................: avg=18.933054  min=16.649324  med=18.722467  max=22.551929  p(90)=20.668266  p(95)=21.073767 
     sse_event..................................: 21270   67.974509/s
     vus........................................: 1       min=1          max=4
     vus_max....................................: 4       min=4          max=4


running (5m12.9s), 0/4 VUs, 165 complete and 0 interrupted iterations
default ✗ [==============================>-------] 4 VUs  5m12.9s/5m0s  165/200 shared iters
bench: shutting down server pid=7822 ...

- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
@phymbert phymbert added performance Speed related topics server labels Dec 10, 2024
@github-actions github-actions bot added examples python python script changes labels Dec 10, 2024
- fix when prometheus not started
- wait for server to be ready before starting bench
@phymbert phymbert removed examples python python script changes labels Dec 27, 2024
@phymbert phymbert marked this pull request as ready for review December 27, 2024 10:11
@phymbert phymbert requested a review from ngxson as a code owner December 27, 2024 10:11
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the hw to test, but LGTM.

Thought, I'm looking forward to migrate to python solution like Locust (as mentioned in the PR description). This can simplify a lot in the installation process, while giving much more flexibility for the script (Ideally, we only need single bench.py script in the future that can do all at once)

@@ -89,6 +90,9 @@ export default function () {
],
"model": model,
"stream": true,
"stream_options": {
"include_usage": true, // False to be supported in llama.cpp server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here, but in llama.cpp we ignore include_usage and always include include the usage info.

@phymbert phymbert merged commit 2f0ee84 into master Jan 2, 2025
9 checks passed
@phymbert phymbert deleted the phymbert/server/bench/fix-streaming branch January 2, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants